7 research outputs found
CNN-SLAM: Real-time dense monocular SLAM with learned depth prediction
Given the recent advances in depth prediction from Convolutional Neural
Networks (CNNs), this paper investigates how predicted depth maps from a deep
neural network can be deployed for accurate and dense monocular reconstruction.
We propose a method where CNN-predicted dense depth maps are naturally fused
together with depth measurements obtained from direct monocular SLAM. Our
fusion scheme privileges depth prediction in image locations where monocular
SLAM approaches tend to fail, e.g. along low-textured regions, and vice-versa.
We demonstrate the use of depth prediction for estimating the absolute scale of
the reconstruction, hence overcoming one of the major limitations of monocular
SLAM. Finally, we propose a framework to efficiently fuse semantic labels,
obtained from a single frame, with dense SLAM, yielding semantically coherent
scene reconstruction from a single view. Evaluation results on two benchmark
datasets show the robustness and accuracy of our approach.Comment: 10 pages, 6 figures, IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR), Hawaii, USA, June, 2017. The first two
authors contribute equally to this pape
Measuring the Interpretability of Unsupervised Representations via Quantized Reverse Probing
Self-supervised visual representation learning has recently attracted
significant research interest. While a common way to evaluate self-supervised
representations is through transfer to various downstream tasks, we instead
investigate the problem of measuring their interpretability, i.e. understanding
the semantics encoded in raw representations. We formulate the latter as
estimating the mutual information between the representation and a space of
manually labelled concepts. To quantify this we introduce a decoding
bottleneck: information must be captured by simple predictors, mapping concepts
to clusters in representation space. This approach, which we call reverse
linear probing, provides a single number sensitive to the semanticity of the
representation. This measure is also able to detect when the representation
contains combinations of concepts (e.g., "red apple") instead of just
individual attributes ("red" and "apple" independently). Finally, we propose to
use supervised classifiers to automatically label large datasets in order to
enrich the space of concepts used for probing. We use our method to evaluate a
large number of self-supervised representations, ranking them by
interpretability, highlight the differences that emerge compared to the
standard evaluation with linear probes and discuss several qualitative
insights. Code at: {\scriptsize{\url{https://github.com/iro-cp/ssl-qrp}}}.Comment: Published at ICLR 2022. Appendix included, 26 page
Neural Feature Fusion Fields: 3D Distillation of Self-Supervised 2D Image Representations
We present Neural Feature Fusion Fields (N3F), a method that improves dense
2D image feature extractors when the latter are applied to the analysis of
multiple images reconstructible as a 3D scene. Given an image feature
extractor, for example pre-trained using self-supervision, N3F uses it as a
teacher to learn a student network defined in 3D space. The 3D student network
is similar to a neural radiance field that distills said features and can be
trained with the usual differentiable rendering machinery. As a consequence,
N3F is readily applicable to most neural rendering formulations, including
vanilla NeRF and its extensions to complex dynamic scenes. We show that our
method not only enables semantic understanding in the context of scene-specific
neural fields without the use of manual labels, but also consistently improves
over the self-supervised 2D baselines. This is demonstrated by considering
various tasks, such as 2D object retrieval, 3D segmentation, and scene editing,
in diverse sequences, including long egocentric videos in the EPIC-KITCHENS
benchmark.Comment: 3DV2022, Oral. Project page: https://www.robots.ox.ac.uk/~vadim/n3f
2017 Robotic Instrument Segmentation Challenge
In mainstream computer vision and machine learning, public datasets such as
ImageNet, COCO and KITTI have helped drive enormous improvements by enabling
researchers to understand the strengths and limitations of different algorithms
via performance comparison. However, this type of approach has had limited
translation to problems in robotic assisted surgery as this field has never
established the same level of common datasets and benchmarking methods. In 2015
a sub-challenge was introduced at the EndoVis workshop where a set of robotic
images were provided with automatically generated annotations from robot
forward kinematics. However, there were issues with this dataset due to the
limited background variation, lack of complex motion and inaccuracies in the
annotation. In this work we present the results of the 2017 challenge on
robotic instrument segmentation which involved 10 teams participating in
binary, parts and type based segmentation of articulated da Vinci robotic
instruments
Constant Velocity Constraints for Self-Supervised Monocular Depth Estimation
We present a new method for self-supervised monocular depth estimation. Contemporary monocular depth estimation methods use a triplet of consecutive video frames to estimate the central depth image. We make the assumption that the ego-centric view progresses linearly in the scene, based on the kinematic and physical properties of the camera. During the training phase, we can exploit this assumption to create a depth estimation for each image in the triplet. We then apply a new geometry constraint that supports novel synthetic views, thus providing a strong supervisory signal. Our contribution is simple to implement, requires no additional trainable parameter, and produces competitive results when compared with other state-of-the-art methods on the popular KITTI corpus
Deep spectral methods: a surprisingly strong baseline for unsupervised semantic segmentation and localization
Unsupervised localization and segmentation are long-standing computer vision challenges that involve decom-posing an image into semantically meaningful segments without any labeled data. These tasks are particularly interesting in an unsupervised setting due to the difficulty and cost of obtaining dense image annotations, but existing un-supervised approaches struggle with complex scenes containing multiple objects. Differently from existing methods, which are purely based on deep learning, we take inspiration from traditional spectral segmentation methods by re-framing image decomposition as a graph partitioning problem. Specifically, we examine the eigenvectors of the Laplacian of a feature affinity matrix from self-supervised networks. We find that these eigenvectors already decompose an image into meaningful segments, and can be readily used to localize objects in a scene. Furthermore, by clustering the features associated with these segments across a dataset, we can obtain well-delineated, nameable regions, i.e. semantic segmentations. Experiments on complex datasets (PASCAL VOC, MS-COCO) demonstrate that our simple spectral method outperforms the state-of-the-art in unsupervised localization and segmentation by a significant margin. Furthermore, our method can be readily usedfor a variety of complex image editing tasks, such as background removal and compositing. 1 1 Project Page: https://lukemelas.github.io/deep-spectral-segmentation